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C^ ' Abstract 

A central problem in ranking is to design a ranking measure for evaluation of ranking func- 
tions. In this paper we study, from a theoretical perspective, the widely used Normalized 
Discounted Cumulative Gain (NDCG)-type ranking measures. Although there are extensive 
empirical studies of NDCG, little is known about its theoretical properties. We first show that, 
whatever the ranking function is, the standard NDCG which adopts a logarithmic discount, 
converges to 1 as the number of items to rank goes to infinity. On the first sight, this result 
is very surprising. It seems to imply that NDCG cannot differentiate good and bad ranking 
functions, contradicting to the empirical success of NDCG in many applications. In order to 
have a deeper understanding of ranking measures in general, we propose a notion referred to as 
consistent distinguishability. This notion captures the intuition that a ranking measure should 
have such a property: For every pair of substantially different ranking functions, the ranking 
measure can decide which one is better in a consistent manner on almost all datasets. We 
show that NDCG with logarithmic discount has consistent distinguishability although it con- 
verges to the same limit for all ranking functions. We next characterize the set of all feasible 
discount functions for NDCG according to the concept of consistent distinguishability. Specif- 
ically we show that whether NDCG has consistent distinguishability depends on how fast the 
discount decays, and r _1 is a critical point. We then turn to the cut-off version of NDCG, 
i.e., NDCG@k. We analyze the distinguishability of NDCG@k for various choices of k and the 
discount functions. Experimental results on real Web search datasets agree well with the theory. 



1 Introduction 

Ranking has been extensively studied in information retrieval, machine learning and statistics. It 
plays a central role in various applications such as search engine, recommendation system, expert 
find ing, to name a few. In many situations one wants to have, by learning, a good ranking function 
[la . ll8L |23j. Thus a fundamental problem is how to design a ranking measure to evaluate the 
performance of a ranking function. 

Unlike classification and regression for which there are simple and natural performance measures, 

evaluating ranking functions has proved to be more difficult. Suppose there are n objects to rank. A 

ranking evaluation measure must induce a total order on the n\ possible ranking results. There seem 

to be many ways to define ranking measures and several evaluation measures have been proposed 

111, |32|, |j, 111 [28( . In fact, as pointed out by some authors, there is no single optimal ranking measure 



that works for any application 16 1. 



The focus of this work is the Normalized Discounted Cumulative Gain (NDCG) which is one of 



the most popular evaluation measures in Web search [2l|,l22(. NDCG has two advantages compared 
to many other measures. First, NDCG allows each retrieved document has graded relevance while 
most traditional ranking measures only allow binary relevance. That is, each document is viewed as 
either relevant or not relevant by previous ranking measures, while there can be degrees of relevancy 
for documents in NDCG. Second, NDCG involves a discount function over the rank while many other 
measures uniformly weight all positions. This feature is particularly important for search engines 
as users care top ranked documents much more than others. 

The importance of NDCG as well as other ranking measures in modern search engines is not 
limited as evaluation metrics. Currently ranking measures are also used as guidance for design of 
ranking functions due to works from the learning to rank area. Although earl y re sults of learning 
to rank often reduce ranking problem to classification or regression [15|, [l8|, |23j, [26|, [5|, recently 



there is evidence that learnin g a ranking function by optimizing a ranking measure such as NDCG 



is a promising approach [331 . |36j. However, using the ranking measure as objective function to 
optimize is computationally intractable. Inspired by approaches in classification, some state of the 
art algorithms optimize a surrogate loss instead 8, 35]. 

In the past a few years, there is rapidly growing interest in studying consistency of learning 
to rank algorithms that optimize surrogate losses. Such studies are motivated by the research of 
consistency of surrogate losses for classification [3ja, Ifs, |37|, |31| , which is a well-established theory 
in machine learning. Consistency of ranking is more complicated than classification as there are 
more than one possible ranking measures. One needs to study consistency with respect to a specific 
ranking measure. That is, whether the minimization of the surrogate leads to optimal predictions 
according to the risk defined by the given evaluation measure. 

The research of consistency for ranking was initiated in [1J, [Tjj. In fact, \\j\ showed that no 
convex surrogate loss can be consistent with the Pairwise Disagreement (PD) measure. This result 
was further generalized in [7|,l9|, where non-existence of convex surrogate loss with Average Precision 
and Expected Reciprocal Rank were proved. 

In contrast to the above negative results, [27[ showed that there do exist NDCG consistent 
surrogates. Furthermore, by using a slightly stronger notion of NDCG consistency they showed 
that any NDCG consistent surrogate must be a Bregman distance. In a sense, these results mean 
that NDCG is a good ranking measure from a learning-to-rank point of view. 

NDCG is a normalization of the Discounted Cumulative Gain (DCG) measure. (For formal 
definition of both DCG and NDCG, please see Section [2]) DCG is a weighted sum of the degree 
of relevancy of the ranked items. The weight is a decreasing function of the rank (position) of the 
object, and therefore called discount. The original reason for introducing the discount is that the 
probability that a user views a document decreases with respect to its rank. NDCG normalizes 
DCG by the Ideal DCG (IDCG), which is simply the DCG measure of the best ranking result. 
Thus NDCG measure is always a number in [0, 1]. Strictly speaking, NDCG is a family of ranking 



measures, since there is flexibility in choosing the discount function. The logarithmic discount 
lo (T+Fp wnere r i s the rank, dominated the literature and applications. We will refer to NDCG 
with logarithmic discount as the standard NDCG. Another discount function appeared in literature 



is r _1 , which is called Zipfian in Information Retrieval 24J. Search engine systems also use a cut-off 
top-k version of NDCG. That is, the discount is set to be zero for ranks larger than k. Such NDCG 
measure is usually referred to as NDCG@k. 

Given the importance and popularity of NDCG, there have been extensive studies on this mea- 
sure, mainly in the field of Information Retrieval [2, |24|, y, |3J, |29( • All these research are conducted 
from an empirical perspective by doing experiments on benchmark datasets. Although these works 
gained insights about NDCG, there are still important issues unaddressed. We list a few questions 
that naturally arise. 



• 



• 



As pointed out in [l6J, there has not been any theoretically sound justification for using a 
logarithmic ( lo (\+ r ) ) discount other than the fact that it is a smooth decay. 

Is it possible to characterize the class of discount functions that are feasible for NDCG? 

For the standard NDCG@k, the discount is a combination of a very slow logarithmic decay 
and a hard cut-off. Why don't simply use a smooth discount that decays fast? 



In this paper, we study the NDCG type ranking measures and address the above questions 
from a theoretical perspective. The goal of our study is twofold. First, we aim to provide a better 
understanding and theoretical justification of NDCG as an evaluation measure. Second, we hope 
that our results would shed light and be useful for further research on learning to rank based on 
NDCG. Specifically we analyze the behavior of NDCG as the number of objects to rank getting 
large. Asymptotics, including convergence and asymptotic normality, of many traditional ranking 
measures have been studied in depth in statistics, especially for Linear Rank Statistics and measures 
that are U-statistics [19), [25J. [12| observed that ranking measures such as Area under the ROC 
Curve (AUC), P-Norm Push and DCG can be viewed as Conditional Linear Rank Statistics. That 
is, conditioned on the relevance degrees of the items, these measures are Linear Rank Statistics 
19]. They show uniform convergence based on an orthogonal decomposition of the measure. The 
convergence relies on the fact that the measure can be represented as a (conditional) average of a 
fixed score-generating function. Part of our work consider the convergence of NDCG and are closely 
related to [12| . However, their results do not apply to our problem, because the score-generating 
function for NDCG is not fixed, it changes with the number of objects. 

1.1 Our Results 

Our study starts from an analysis of the standard NDCG (i.e., the one using logarithmic discount). 
The first discovery is that for every ranking function, the NDCG measure converges to 1 as the 
number of items to rank goes to infinity. This result is surprising. On the first sight it seems to 
mean that the widely used standard NDCG cannot differentiate good and bad ranking systems 
when the data is of large size. This problem may be serious because huge dataset is common in 
applications such as Web search. 

To have a deeper understanding of NDCG, we first study what are the desired properties a 
good ranking measure should have. In this paper we propose a notion referred to as consistent 
distinguishability, which we believe that every ranking measure needs to have. Before describing 
the definition of consistent distinguishability, let us see a motivating example. Suppose we want to 
select, from two ranking functions /i, /2, a better one on ranking "sea" images (that is, if an image 
contains sea, we hope it is ranked near the top) . Since there are billions of sea images on the web, a 
commonly used method is to randomly draw, say, a million data and evaluate the two functions on 
them. A crucial assumption underlying this approach is that the evaluation result will be "stable" 



on large datasets. That is, if on this randomly drawn dataset /i is better than fa according to 
the ranking measure, then with high probability over the random draw of another large dataset, fa 
should still be better than fa. In other words, fa is consistently better than fa according to the 
ranking measure. 

Our definition of consistent distinguishability captures the above intuition. It requires that 
for two substantially different ranking functions, the ranking measure can decide which one is 
better consistently on almost all datasets. (See Definition [3] for formal description.) In a broader 
sense, consistent distinguishability is a desired property to all performance statistics (not only to 
ranking). For classification and regression, this property trivially holds because of the simplicity of 
the evaluation measures. For ranking however, things are much more complicated. It is not a priori 
clear whether important ranking measures such as NDCG have consistent distinguishability. 

Our next main result shows that although the standard NDCG always converges to 1, it can 
consistently distinguishes every pair of substantially different ranking functions. Therefore, if one 
ignores the numerical scaling problem, standard NDCG is a good ranking measure. 

We then study NDCG with other possible discount. We characterize the class of discount 
functions that are feasible for NDCG. It turns out that the Zipfian r~ x is a critical point. If 
a discount function decays slower than r -1 , the resulting NDCG measure has strong power of 
consistent distinguishability. If a discount decays substantially faster than r , then it does not 
have this desired property. Even more, such ranking measures do not converge as the number of 
objects to rank goes to infinity. 

Interestingly, this characterization result also provides a better understanding of the cut-off 
version NDCG@k. In particular, it gives a theoretical explanation to the previous question that 
why popular NDCG@k uses a combination of slow logarithmic decay and a hard cut-off as its 
discount rather than a smooth discount which decays fast. 

Finally we consider how to choose the cut-off threshold for NDCG@k from the distinguishability 
point of view. We analyze the behavior of the measure for various choices of k as well as the discount. 
We suggest that choosing k as certain function of the size of the dataset may be appropriate. 

The rest of this paper is organized as follows. Section [5] provides basic notions and definitions. 
Section [3] contains the main theorems and key lemmas for the distinguishability theorem. The 
experimental results are given in |H All proofs are given in Appendix lATEl 

2 Preliminaries 

Let X be the instance space, and let x\, . . . , X n (xi £ X) be n objects to rank. Let y be a finite set 
of degrees of relevancy. The simplest case is y = {0, 1}, where corresponds to "irrelevant" and 1 
corresponds to "relevant" . Generally y may contain more numbers; and for y (z y, the larger y is, 
the more relevant it represents. Let / be a ranking functiorp. We assume that / is a mapping from 
X to R. For each object x E X, f gives it a score f{x). For n objects x±, . . . ,x n , f ranks them 
according to their scores f(x\), . . . , f(x n ). The resulting ranking list, denoted by xL\, . . . , x, ■,, 

satisfies / [xf^J >...>/ (*(„))• 

Let yi, ■ ■ ■ ,y n (yt & y) be the degree of relevancy associated with x\,...,x n . We will denote by 
S n = {(xi, yi), • ■ • , (x n , y n )} the set of data to rank. As in existing literature [18|, |l3|, we assume 
that (xi, yi), . . . , (x n , y n ) are i.i.d. sample drawn from an underlying distribution Pxy over X x y . 



Also let yL), • ■ • , yj ) be the corresponding relevancy of 



X (l)' ■ • ■ > X '(n)' 



lr The ranking function we defined is often called scoring function in literature; and ranking function has a more 
general definition: For fixed n, a general ranking function can be any permutation on [n]. However, scoring functions 
are used by most search engines. Also in this paper we study the behavior of the ranking measure of a fixed ranking 
function as n grows, so we focus on scoring functions. But note that Theorem[l]and Theorem [6] hold for any sequence 
of general ranking functions. 



The following is the formal definition of NDCG. Here we give a slightly simplified version tailored 
to our problem. 

Definition 1. Let D(r) (r > 1) be a discount function. Let f be a ranking function, and S n be a 
dataset. The Discounted Cumulative Gain (DCG) of f on S n with discount D is defined ac| 

n 

DCG D (f,S n ) = J2y( r) D(r). (1) 

r=l 

Let the Ideal DCG defined as IDCG£>(5 rl ) = xass.fi y^"_ y yf \D(r) be the DCG value of the best 
ranking function on S n . 

The NDCG of f on S n with discount D is defined as 

"wxw.so - «# (2) 

We call NDCG standard, if its associated discount function is the inverse logarithm decay D(r) = 
lo (i+r) ■ Note that the base of the logarithm does not matter for NDCG, since constant scaling will 
cancel out due to normalization. We will assume it is the natural logarithm throughout this paper. 

An important property of eq.Q is that if a ranking function /' preserves the order of the ranking 
function /, then NDCGrj>(/', S n ) = NDCG_d(/, S n ) for all S n . Here by preserving order we mean 
that for \/x,x' E X, f(x) > f(x') implies f'(x) > f'(x'), and vice versa. Thus the ranking measure 
NDCG is not just defined on a single function /, but indeed defined on an equivalent class of ranking 
functions which preserve order of each other. 

Below we will frequently use a special ranking function / that preserves the order of /. 

Definition 2. Let f be a ranking function. We call f the canonical version of f , which is defined 
as 

/(*)= Pr {f(X)<f(x)}. 

The canonical / has the following properties, which can be easily proved by the definition. 

Lemma 1. For every ranking function f, its canonical version f preserves the order of f. In 
addition, f(X) has uniform distribution on [0,1]. 

Finally, we point out that although originally the discount D(r) is defined on positive integers r, 
below we will often treat D(r) as a function of a real variable. That is, we view r take nonnegative 
real values. We will also consider derivative and integral of D(r), denoted by D'(r) and J D(r)dr 
respectively. 

3 Main Results 

In this section, we give the main results of the paper. In Section [3Tl we study the standard NDCG, 
i.e., NDCG with logarithmic discount. In Section EOl we consider feasible discount other than the 
standard logarithmic one. We analyze the top-k cut-off version NDCG@k in Section [5751 For clarity 
reasons, some of the results in Section 13. 1[ 13.21 and 13.31 are given for the simplest case that the 
relevance score is binary. Section 13.41 provides complete results for the general case. 



2 Usually DCG is defined as DCG^C/, S n ) = E™=1 G(yf *)D(r), where G is a monotone increasing function (e.g., 
G(y) = 2 y — 1). Here we omit G for notational simplicity. This does not lose any generality as we can assume that 
y changes to G(y). 



3.1 Standard NDCG 

To study the behavior of the standard NDCG, we first consider the limit of this measure when 
the number of objects to rank goes to infinity. As stated in Section [2j we assume the data are 
i.i.d. drawn from some fixed underlying distribution. Surprisingly, it is easy to show that for every 
ranking function, standard NDCG converges to 1 almost surely. 

Theorem 1. Let D(r) — lo (\ +r \ ■ Then for every ranking function f , 

NDCG D (/,5„)^1, a.s. 

The proof is given in Appendix [D] 

At the first glance, the above result is quite negative for standard NDCG. It seems to say that in 
the limiting case, standard NDCG cannot differentiate ranking functions. However, Theorem Q] only 
considers the limits. To have a better understanding of NDCG, we need to make a deeper analysis 
of its power of distinguishability. In particular, Theorem [T] does not rule out the possibility that the 
standard NDCG can consistently distinguish substantially different ranking functions. Below we 
give the formal definition that two ranking functions are consistently distinguishable by a ranking 
measure M. 

Definition 3. Let (xi, y{), (22, 2/2), •• • be i.i.d. instance-label pairs drawn from the underlying dis- 
tribution Pxy over X x y. Let S n = {(xi, j/i), . . . , (x n , y n )}- A poir of ranking functions f , /1 is 
said to be consistently distinguishable by a ranking measure M. , if there exists a negligible function^ 
neg(A) and b € {0, 1} such that for every sufficiently large N , with probability 1 — neg(A), 

M(f b ,S n )>M(fi- b ,S„), 

holds for all n > N simultaneously. 

Consistent distinguishability is appealing. One would like a ranking measure M. to have the 
property that every two substantially different ranking functions are consistently distinguishable by 
M.. The next theorem shows that standard NDCG does have such a desired property. For clarity, 
here we state the theorem for the simple binary relevance case, i.e., y — {0, 1}. It is easy to extend 
the result to the general case that y is any finite set. 

Theorem 2. For every pair of ranking functions fo,fi, let y^(s) — Pr[Y — l\fi(X) — s], i — 
0,1. Assume y-' (s) andy' 1 (s) are Holder continuous in s. Then, unless y-' (s) = y-' 1 (s) almost 
everywhere on [0, 1], /o and f\ are consistently distinguishable by standard NDCG. 

The proof is given in Appendix IA1 

Theorem [2] provides theoretical justification for using standard NDCG as a ranking measure, 
and answers the first question raised in Introduction. Although standard NDCG converges to 
the same limit for all ranking functions, it is still a good ranking measure with strong consistent 
distinguishability (if we ignore the numerical scaling issue) . 

3.2 Characterization of Feasible Discount Functions 

In the previous section we demonstrate that standard NDCG is a good ranking measure. In both 
literatures and real applications, standard NDCG is dominant. However, there is no known theoret- 
ical evidence that the logarithmic function is the only feasible discount, or it is the optimal one. In 
this subsection, we will investigate other discount functions. We study the asymptotic behavior and 



A negligible function neg(TV) means that for Vc, neg(TV) < N c for sufficiently large TV. 



distinguishability of the induced NDCG measures and compare to the standard NDCG. Finally, we 
will characterize the class of discount functions which we think are feasible for NDCG. For the sake 
of clarity, the results in this subsection are given for the simplest case that y = {0, 1}. Complete 
results will be given in Section l3~4l 

Standard NDCG utilizes the logarithmic discount which decays slowly. In the following we first 
consider a discount that decays a little faster. Specifically we consider D(r) — r~@ (0 < j3 < 1). Let 
us first investigate the limit of the ranking measure as the number of objects goes to infinity. 

Theorem 3. Assume D(r) = r~P where j3 6 (0,1). Assume also p = Pr[Y" = 1] > and 
y' (s) = Pr[Y" = 1|/(A) = s] is a continuous function. Then 



NDCG D (/, S n ) 4 v ~ " /J ° a IV~ ~ ~ - (3) 



P 

The proof will be given in Appendix ID1 



For D(r) = r~@ ((3 G (0,1)). NDCG no longer converges to the same limit for all ranking 
functions. The limit is actually a correlation between |/ (s) and (1 — s) _/3 . For a good ranking 
function /, yf(s) = Pr[Y = l\f(X) = s] is likely to be an increasing function of s, and thus has 
positive correlation with (1 — s)~@ . Therefore, the limit of the ranking measure already differentiate 
good and bad ranking functions to some extent. 

We next study whether NDCG with polynomial discount has power of distinguishability as strong 
as the standard NDCG. That is, we will see if Theorem H holds for NDCG with r~ & (/? <E (0, 1)). 

Theorem 4. Let D(r) — r~@ , j3 £ (0, 1). Assume p — Pr[Y = 1] > 0. For every pair of ranking 
functions f , f lt denote y fi (s) = Pr[Y = l\fi(X) = s], i = 0,1, and Ay (a) = y fo (s) -y fl (s). 
Suppose at least one of the following two conditions hold: 1) L Ay(s)(l — s)~°ds =t 0; 2) y*°{s), 
y' 1 (s) are Holder continuous with Holder continuity constant a satisfying a > 3(1-/3), and Ay(l) ^ 
0. Then /o and f\ are strictly distinguishable with high probability by NDCG with discount D(r). 

The proof will be given in Appendix [E] 

Theorem [4] involves two conditions. Satisfying either of them leads to strictly distinguishable 
with high probability. The first condition simply means that NDCG_d(/o, S n ) and NDCG.d(/i, S n ) 
converge to different limits and therefore the two functions are consistently distinguishable in the 
strongest sense. The second condition deals with the case that NDCGd(/o, S n ) and NDCG.d(/i, S n ) 
converge to the same limit. Comparing the distinguishability of NDCG with r^~^ discount with 
the standard NDCG, in most cases A~^ discount has stronger distinguishability than standard 
NDCG (i.e., when the measures of two ranking functions converge to different limits). On the other 
hand, if we consider the worst case, standard NDCG is better, because it requires less conditions 
for consistent distinguishability. 

We next study the Zipfian discount D(r) = r^ 1 . The following theorem describes the limit of 
the ranking measure. 

Theorem 5. Assume D(r) = r _1 . Assume also p = Pr[Y" = 1] > andyf(s) = Pr[y = l\f(X) = 
s] is a continuous function. Then 

NDCG D (f, S n ) 4 Pr[F = 1|/(X) = 1]. (4) 

The proof of Theorem [5] will be given in Appendix [Pi 

The limit of NDCG with Zipfian discount depends only on the performance of the ranking 
function for the top ranks. The relevancy of lower ranked items does not affect the limit. 



The next logical step would be analyzing the power of distinguishability of NDCG with Zipfian 
discount. However we are not able to prove that consistent distinguishability holds for this ranking 
measure. The techniques developed for distinguishability theorems given above does not apply to 
the Zipfian discount. Although we cannot disprove it distinguishability, we suspect that Ziphan 
does not have strong consistent distinguishability power. 

Finally, we consider discount functions that decay substantially faster than r~ 1 . We will show 
that with these discount, NDCG does not converge as the number of objects tends to infinity. More 
importantly, such NDCG does not have the desired consistent distinguishability property. 

Theorem 6. Let X be instance space. For any x € X , let y* = argmax e -yPr(Y" = y\X = x). 
Assume that there is an absolute constant S > such that for every x G X , Pr(Y — y\X = x) > 
<5-Pr(T = y*\X = , x) for all y G y. IfJ2T=i D ( r ) < B for some constant B > 0, thenKDCG D (f, S n ) 
does not converge in probability for any ranking function f. In particular, if D(r) < r~( 1+e > for 
some e > 0, NDCGd(/, S n ) does not converge. Moreover, every pair of ranking functions are not 
consistently distinguishable by NDCG with such discount. 

The proof is given in Appendix [D] 



Now we are able to characterize the feasible discounts for NDCG according to the results given 
so far. The logarithmic ; ,\ +r \ and polynomial r~^ (/3 <E (0, 1)) are feasible discount functions for 

NDCG. For different ranking functions, standard NDCG converges to the same limit while the r _/3 
(/3 6 (0, 1)) one converges to different limits in most cases. However, if we ignore the numerical 
scaling issue, both logarithmic and r~^ ((3 6 (0, 1)) discount have consistent distinguishability. 
The Zipfian r _1 discount is on the borderline. It is not clear whether it has strong power of 
distinguishability. Discount that decays faster than r~( 1+e ) for some e > is not appropriate for 
NDCG when the data size is large. 

3.3 Cut-off Versions of NDCG 

In this section we study the top-k version of NDCG, i.e., NDCG@fc. For NDCG@fc, the discount 
function is set as D(r) = for all r > k. The motivation of using NDCG@/c is to pay more 
attention to the top-ranked results. Logarithmic discount is also dominant for NDCG@k. We will 
call this measure standard NDCG@k. As already stated in Introduction, a natural question of 
standard NDCG@k is why use a combination of a very low logarithmic decay and a hard cut-off as 
the discount function. Why not simply use a smooth discount with fast decay, which seems more 
natural. In fact, this question has already been answered by Theorem^) NDCG with such discount 
does not have strong power of distinguishability. 

We next address the issue that how to choose the cut-off threshold k. It is obvious that setting k 
as a constant independent of n is not appropriate, because the partial sum of the discount is bounded 
and according to Theorem [6] the ranking measure does not converge. So k must grow unboundedly 
as n goes to infinity. Below we investigate the convergence and distinguishability of NDCG@k for 
various choices of k and the discount function. For clarity reason we assume here y — {0, 1}, and 
general results will be given in Section 13.41 The proofs of all theorems in this section will be given 
in Appendix [Pi We fist consider the case k = o(n). 

Theorem 7. Let y = {0, 1}. Assume D(r) is a discount function and X)r=i D(r) is unbounded. 
Suppose k — o(n) and k — > oo as n — > oo. Let D(r) = D(r) for all r < k and D(r) = for all r > k. 
Assume also that p = Pr[Y = 1] > and y* (s) = Pr[Y" = l\f(X) = s] is a continuous function. 
Then 

NDCG^(/, S n ) 4 Pr[Y = 1|/(X) = 1]. (5) 



The limit of NDCG@k where k = o(n) is exactly the same as NDCG with Zipfian discount. Also 
like the Zipfian, the distinguishability power of this NDCG@k measure is not clear. 

We next consider the case k = en for some constant c £ (0, 1). We study the standard logarithmic 
and the polynomial discount respectively in the following two theorems. 

Theorem 8. Assume D(r) = lo ,}, , and y — {0, 1}. Let k — en for some constant c £ (0, 1). 

Define the cut-off discount function D as D(r) = D(r) if r < k and D(r) — otherwise. Assume 
also p = Pr[Y = 1] > and y*(s) — Pt\Y = l\f{X) = s] is a continuous function. Then 

NT)CG 6 (f,S n ) -4 . j x ■ Pt[Y = 1|/(X) > 1 - c]. (6) 

rrun{c,j>} 

Theorem 9. Assume D(r) — r~P and y — {0, 1}, where j3 £ (0, 1). Let k — en for some constant 
c £ (0, 1). Define the cut-off discount function D(r) = D(r) if r < k and D{r) = otherwise. 
Assume also p = Pt[Y = 1] > and y*(s) = Pi[Y = l|r(X) = s] is a continuous function. Then 

'"^■f y f (s).(l-s)^ds 



(min{c,p}) 1_ ^ 



NDCG 6 (/,S n )A , . T "._„ ■ / V T (*)-0--s)~ p de- (7) 



l-c 



The consistent distinguishability of the two measures considered in Theorem [5] and Theorem [5] 
are similar to their corresponding full NDCG respectively. To be precise, for NDCG@k (k = en) 
with logarithmic discount and NDCG@k with r~P (fi e (0, 1)) discount, consistent distinguishability 
holds under the condition given in Theorem [5] and Theorem 3] respectively. Hence these two cut-off 
versions NDCG are feasible ranking measures. 

3.4 Results for General y 

Some theorems given so far assume y = {0, 1}. Here we give complete results for the general case 
that |3^| > 2, and y = {rji, . . . , t)\y\}. We only state the theorems and omit the proofs, which are 
straightforward modifications of the special case y = {0, 1}. The case D(r) = lo ,\ +r \ has already 
been included in Theorem [T] It always converges to 1 whatever the ranking function is. We next 
consider r~" decay. 

Theorem 10. Assume D(r) — r~@ with (3 £ (0,1). Suppose that y = {t)i,... ,tjiyi}, where t)i > 
. . . > X)\y\ . Assume f(X) £ [a, b]; f(X) has a probability density function such that V(f(X) — s) > 
for all s £ [a, b]; Pr(Y" = tjj) > and Pr(Y = t)j\f(X) = s) is a continuous function of s for all j . 
Then 

NDrr .. 9 , - (l-/3)/ 1 E[r|/(X)^.s](l- s )-^ 

DiL n) — m^r^j) — 

where R o = 0; Rj = Pr(Y > t)j). 

The next theorem is for top-k type NDCG measures, where k = o(n). 

Theorem 11. Suppose that y = {rji, . . . , t)iyi}, where t)i > . . . > t)\y\. Assume D{r) and k grow 
unboundedly and k/n = o(l). For any n, let D(r) = D(r) if r < k and D(r) = otherwise. Assume 
f(X) £ [a,b]; f{X) has a probability density function such that ¥(f(X) = s) > for all s £ [a,b\; 
Pr(Y" = t)j) > and Pr(Y" = X)j\f(X) = s) is a continuous function of s for all j. Then 

NDCG 5 (f,S n ) A 1 -E[Y\f(X) = 1]. 
Oi 



The last two theorems are for top-A;, where k/n — c. We consider both logarithm discount and 
polynomial discount separately. 

Theorem 12. Suppose that y — {rji, . . . , rjiyi}, where rji > ... > tjiyi. Let k/n = c for some 
constant c > 0. Let D{r) = -, — }. , . For any n, let D(r) — D(r) if r < k and D(r) — otherwise. 
Assume f(X) € [a, b]; f(X) has a probability density function such that P(/(X) = s) > for all 
s G [a, b]; Pr(y = t)j) > and Pr(Y" = t)j\f(X) = s) is a continuous function of s for all j . Then 

NDCG D (/, S „)A c E[ y|/ m >l-c] 



E;.i%(«j--Rj-i) + l).+i(c-fle) 

where Rq = 0; Rj = F(Y > t)j); t is defined by R t < c < Rt+i- 

Theorem 13. Let D(r) — r~@ with f3 € (0, 1), and D(r) — D(r) if r < k and D(r) = otherwise. 
Using the same notions and under the same conditions as in Theorem \12\ 

(1-/3) A 1 E\Y\f(X) = s}{l-s)- l3 ds 
NDCG 5 (/,5 n ) A -V r; ' 1 r^- 

4 Experimental Results 

All theoretical results in this paper are proved under the assumption that the objects to rank are 
i.i.d. data. Often in real applications the data are not strictly i.i.d or even not random. Here we 
conduct experiments on a real dataset — Web search data. The aim is to see to what extent the 
behavior of the ranking measures on real datasets agree with our theory obtained under the i.i.d. 
assumption. 

The dataset we use contains click-through log data of a search engine. We collected the clicked 
documents for 40 popular queries as test set, which are regarded as 40 independent ranking tasks. 
In each task, there are 5000 Web documents with clicks. To avoid heavy work of human labeling, we 
simply label each document by its click number according to the following rule. We assign relevancy 
y = 2 to documents with more than 1000 clicks, 1 to those with 100 to 1000 clicks, and to the rest. 
In each task, we extracted 40 features for each item representing its relevance to the given query. A 
detail is how to construct S n . In our theoretical analysis we assume S n contains i.i.d. data. Since 
the goal of the experiments is to see how our theory works for real applications, we construct S n as 
follows. For each query, there are totally 5000 documents which we denote by X\, . . . , X5000 Assume 
each document has a generating time. Without loss of generality we assume x\ was generated earliest 
and £5000 latest. We set S n = {(xi,j/i), . . . , (x n ,y n )} for each 1 < n < 5000. Such a construction 
simulates that in reality there may be increasing number of documents needed to rank by a search 
eng ine over time. We use three ranking functions in the experiments: a trained RankSVM model 
[201 . a trained ListNet model [id ], and a function chosen randomly. To be concrete, the random 
function is constructed as follows. For each x £ X, we set f(x) by choosing a number uniformly 
random from [—1, 1]. For the trained models (i.e., listNet and RankSVM), parameters are learned 
from a separate large training set construct in the same manner as the test set. Clearly, ListNet 
and RankSVM are relatively good ranking functions and the random function is bad. 

We analyze the following typical NDCG type ranking measures by experiments: 

• Standard NDCG: D(r) = log( } +r) . See Figure[fl Theorem Q] and Theorem^ 

• NDCG with a feasible discount function: D{r) =r~ x l 2 . See Figure[2] Theorem[3]and Theorem 

in 

• NDCG with too fast decay: D(r) = 2~ r '. See FigureOand Theorem^] 
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Figure 1: Standard NDCG: Converges to the same limit but distinguishes well the ranking functions. 
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Figure 2: NDCG with feasible discount D(r) = r 1 ^ 2 : converges to different limits and distinguishes 
well the ranking functions. 
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Figure 3: NDCG with too fast decay D(r) = 2 r : does not converge; does not have good distin- 
guishability power either. 
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Figure 4: NDCG@fc (D(r) = lo i\ +r \ : k = n/5): distinguishes well the ranking functions. 
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• NDCG@k: k = n/5; D{r) = log( i +r) ■ See Figure land Theorem E 

Figure [T] agrees well with Theorem Q] and Theorem [2] On the one hand, the NDCG measures 
of the three ranking functions are very close and seem to converge to the same limit. On the other 
hand, one can see from the enlarged part (we enlarge and stretch the vertical axis) in the figure that 
in fact the measures distinguish well the ranking functions. 

Figure [5] demonstrates the result of NDCG with the feasible discount r -1 / 2 . In this experiment, 
it seems that the ranking measures of the three ranking functions converge to different limits and 
therefore distinguish them very well. In our experimental setting, it is not easy to find two ranking 
functions whose NDCG measures converge to the same limit. If one can find such a pair of ranking 
functions, it would be interesting to see how well the measure distinguish them. 

Figure [3] shows the behavior of NDCG with a smooth discount which decays too fast. The 
measure cannot distinguish the three ranking functions very well. Even the randomly chosen function 
has an NDCG score similar to those of RankSVM and ListNet. From the figure, it is also likely that 
the measures do not converge. 

Figure @] depicts the result of NDCG@k, where k is a constant proportion of n. Before describing 
the result, let us first comparing Theorem [S] and Theorem [1] Note that although the discount are 
both the logarithmic one, NDCG@k for k = en can converge to different limits for different ranking 
functions, while standard NDCG always converges to 1. Figure [4] clearly demonstrate this result. 

Acknowledgement 

Liwei Wang would like to thank Kai Fan and Ziteng Wang for long and helpful discussions. 

References 

[1] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for 
the area under an ROC curve. 2004. 

[2] A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between IR effectiveness mea- 
sures and user satisfaction. In SIGIR, pages 773-774, 2007. 

[3] J. Aslam, E. Yilmaz, and V. Pavlu. The maximum entropy method for analyzing retrieval 
measures. In Proceedings of the 28th annual international ACM SIGIR conference on Research 
and development in information retrieval, pages 27-34. ACM, 2005. 

[4] R. Baeza- Yates and B. Ribeiro-Neto. Modern Information Retrieval, volume 82. Addison- 
Wesley New York, 1999. 

[5] M. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. Sorkin. Robust 
reductions from ranking to classification. Machine learning, 72(1):139-153, 2008. 

[6] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal 
of the American Statistical Association, 101(473):138-156, 2006. 

[7] D. Buffoni, C. Calauzenes, P. Gallinari, and N. Usunier. Learning scoring functions with 
order-preserving losses and standardized supervision. In Proceedings of the 28th International 
Conference on Machine Learning, 2011. 

[8] C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In 
Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, 
volume 19, page 193. The MIT Press, 2007. 



13 



[9] C. Calauzcnes, N. Usunier, and P. Gallinari. On the (non-)existence of convex, calibrated 
surrogate losses for ranking. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, 
editors, Advances in Neural Information Processing Systems 25, pages 197-205. 2012. 

[10] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. Learning to rank: from pairwise approach to 
listwise approach. In ICML, pages 129-136, 2007. 

[11] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded rele- 
vance. In Proceedings of the 18th ACM conference on Information and knowledge management, 
pages 621-630. ACM, 2009. 

[12] S. J. Clcmcncon and N. Vayatis. Empirical performance maximization for linear rank statis- 
tics. In D. Roller, D. Schuurmans, Y. Bcngio, and L. Bottou, editors, Advances in Neural 
Information Processing Systems 21, pages 305-312. 2009. 

[13] S. Clemencon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. 
The Annals of Statistics, 36(2):844-874, 2008. 

[14] D. Cossock and T. Zhang. Statistical analysis of bayes optimal subset ranking. Information 
Theory, IEEE Transactions on, 54(11):5140-5154, 2008. 

[15] R. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Pro- 
cessing Systems, 2002. 

[16] W. Croft, D. Metzler, and T. Strohman. Search engines: Information retrieval in practice. 
Addison- Wesley, 2010. 

[17] J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In Proceedings 
of the 27th International Conference on Machine Learning, pages 327-334, 2010. 

[18] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An efficient boosting algorithm for combining 
preferences. The Journal of Machine Learning Research, 4:933-969, 2003. 

[19] J. Hajek, Z. Sidak, and P. Sen. Theory of rank tests. Academic press New York, 1967. 

[20] R. Herbrich, T. Graepel, and R. Obermayer. Large margin rank boundaries for ordinal regres- 
sion. Advances in Neural Information Processing Systems, pages 115-132, 1999. 

[21] R. Jarvelin and J. Rekalainen. IR evaluation methods for retrieving highly relevant docu- 
ments. In Proceedings of the 23rd annual international ACM SIGIR conference on Research 
and development in information retrieval, pages 41-48. ACM, 2000. 

[22] R. Jarvelin and J. Rekalainen. Cumulated gain-based evaluation of IR techniques. ACM 
Transactions on Information Systems (TOIS), 20(4):422-446, 2002. 

[23] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth 
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133— 
142. ACM, 2002. 

[24] E. Ranoulas and J. A. Aslam. Empirical justification of the gain and discount function for 
NDCG. In Proceedings of the 18th ACM conference on Information and knowledge management, 
pages 611-620. ACM, 2009. 

[25] M. Rendall. A new measure of rank correlation. Biometrika, 30(l/2):81-93, 1938. 

[26] R. Nallapati. Discriminative models for information retrieval. In Proceedings of the 27th annual 
international ACM SIGIR conference on Research and development in information retrieval, 
pages 64-71. ACM, 2004. 

14 



[27] P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In 
Proceedings of 1 4th International Conference on Artificial Intelligence and Statistics, AISTATS, 
2011. 

[28] C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top 
of the list. The Journal of Machine Learning Research, 10:2233-2271, 2009. 

[29] T. Sakai. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th 
annual international ACM SIGIR conference on Research and development in information 
retrieval, pages 525-532. ACM, 2006. 

[30] G. Sansone. Orthogonal Functions. Interscience Publishers Inc., New York, 1959. 

[31] A. Tewari and P. Bartlctt. On the consistency of multiclass classification methods. Journal of 
Machine Learning Research, 8:1007-1025, 2007. 

[32] A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. 
In Proceedings of the 29th annual international ACM SIGIR conference on Research and de- 
velopment in information retrieval, pages 11-18. ACM, 2006. 

[33] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing NDCG measure. 
Advances in Neural Information Processing Systems, 22:1883-1891, 2009. 

[34] E. Voorhees. Evaluation by highly relevant documents. In Proceedings of the 24th annual 
international ACM SIGIR conference on Research and development in information retrieval, 
pages 74-82. ACM, 2001. 

[35] F. Xia, T. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: theory 
and algorithm. In Proceedings of the 25th international conference on Machine learning, pages 
1192-1199. ACM, 2008. 

[36] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing 
average precision. In Proceedings of the 30th annual international ACM SIGIR conference on 
Research and development in information retrieval, pages 271-278. ACM, 2007. 

[37] T. Zhang. Statistical analysis of some multi-category large margin classification methods. The 
Journal of Machine Learning Research, 5:1225-1251, 2004. 

[38] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk 
minimization. Annals of Statistics, pages 56-85, 2004. 



A Proof of Theorem El the Key Lemmas 

In this section we will prove Theorem [2j In fact we will prove a more complete result. The proof 
relies on a few key lemmas. In this section we only state these lemmas. Their proofs will be given in 
Appendix[BJ First we give a weaker definition of distinguishability, which guarantees that the ranking 
measure M. gives consistent comparison results for two ranking functions only in expectation. 

Definition 4. Fix an underlying distribution Pxy- A pair of ranking functions /o, f\ is said to 
be distinguishable in expectation by a ranking measure M., if there exist b £ {0, 1} and a positive 
integer N such that for all n > N, 

E[M(f b ,S n )] >E[M(f 1 - b ,S n )], 

where the expectation is over the random draw of S n . 
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Now we state a theorem which contains Theorem [2] 



Theorem 14. Assume that p — Pr(Y" = 1) > 0. For every pair of ranking functions fo, fi> Let 
y^(s) = Pr[Y" = l\fi(X) — s], i = 0,1. Unless y^°(s) = y^(s) almost surely on [0,1], /o,/i are 
distinguishable in expectation by standard NDCG whose discount is D(r) = , ,. , . 

Moreover, ify' a (s) andy' 1 {s) are Holder continuous in s, then unless y* (s) = y (s) almost 
everywhere on [0, 1], fa and f\ are consistently distinguishable by standard NDCG. 

To prove Theorem 1 141 we need some notations. 

Definition 5. Suppose y = {0,1}. Lety f (s) = Pr[Y~ = l\f(X) = a]. Also let F(t) = j[ D(s)ds. 
We define the unnormalized pseudo-expectation Njj(n) as 

N f D {ri) = I y f (l-s/n)D(s)ds = n y f (1 - s)D(ns)ds. 



Assume that p = Pr(Y" = l)>0. Define the normalized pseudo-expectation N D (j 



N f D (n) = 



F(np) 



The proof of the first part of Theorem Q3] (i.e., distinguishable in expectation) relies on the 
following two key lemmas, whose proofs will be given in Appendix [Bl 



Lemma 2. Let D(r) 
f, 



log(l+r) 



Assume that p = Pr(Y" = 1) > 0. Then for every ranking function 



E[m)CG D {f,S n )]-N J D (n) 



f, 



<0 



(»-v) 



Lemma 3. Let D{r) = lo<r J +r) ■ Assume thatp = Pr(F = 1) > 0. lety fi {s) = Pr[Y = l\fi{X) = s], 

i = 0, 1. Unless y (") = V (•) almost everywhere on [0, 1], there must exist a nonnegative integer 
K and a constant a^0, such that 

Ng(n)-N£(n) — --() 



log K n 



log K+1 n 



Lemma [2] says that the difference between the expectation of the NDCG measure of a ranking 
function and its pseudo-expectation is relatively small; while Lemma [3] says that the difference 
between the pseudo-expectations of two essentially different ranking functions are much larger. 

To prove the "moreover" part of Theorem 1141 (i.e., consistently distinguishable), we need the 
following key lemma, whose proof will be given in Section [B] The lemma states that with high 
probability the NDCG measure of a ranking function is very close to its pseudo-expectation. 



Lemma 4. Let D{r) 



Assume that p = Pr(y = 1) > 0. Suppose the ranking function f 



log(l+r)'_ 

satisfies that y' (s) = Pr(Y = l\f(X) = s) is Holder continuous with constants a > and C > 0. 
That is, |f / (s)-y / (s')| < C\s-s'\ a for all a, a' G [0,1]. Then 



Pr 



NDCG D (f,S n )-N J D (n) 



I, 



>5Cp 



-1 — rain(a/3,l) 



<o 



Proof of Theorem 1141 That /o and f\ are strictly distinguishable in expectation by standard 
NDCG is straightforward from Lemma [2] and Lemma [3] That /o and f\ are strictly distinguish- 
able with high probability follows immediately from Lemma IH Lemma [3] and the observation that 



y 



,V4 



>N ' 



<Q N 3 / 4 e 



-jv 1 / 4 



<0 



(e-^ 1/5 ) 



□ 
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B Proofs of the Key Lemmas in Appendix 

In this section, we give proofs of the three key lemmas in Appendix |A~1 (i.e.. Lemma [2] Lemma [3] 
and Lemma [4]) used to prove Theorem [2] and Theorem 1141 

To prove the key lemmas, we need a few technical claims, whose proofs will be given in Appendix 
[Cl We first give four claims that will be used in the proof of Lemma [2] 

Claim 1. For any s e [0, 1], 

n 

]TP[/>( r) ) = s]=n. (8) 

r=l 

Claim 2. Recall that the DCG ranking measure with respect to discount D(-) was defined as 

n 

DCGz,(/,£ B ) = X>f r) £>(r). (9) 

Let D(r) = j^^, andyf(s) = Pr[F = 1|/(X) = s}. Then 



E[DCG D (/,S n )] = £ — - 



^ log(l + r) J 



f(x{ r) ) = l-s\yf(l-s)d 



'(r) 



(10) 



Claim 3. For any positive integer n, define E n>r = [— — n 1 ' 3 , — + n 1 ' 3 } (r G [n]). Then for any 
re [n], 



Pr[l-M,)e£„, r ] >l-2e" 



(11) 



Claim 4. Let y = {0,1}. Assume D{r) = lo i\ +r) ■ Let F(t) = f x D(s)ds. Assume also p = 

Pr[Y = 1] > 0. Then for every sufficiently large n, with probability (1 — 2e~ 2n ) the following 
inequality holds. 

NDCG D (f,S n ) 
Now we are ready to prove Lemma [2j 



DCGc (/,£„) 



F(np) 



< O (n- 1 / 3 ) 



(12) 



Proof, of Lemma [2j By the definition of N D {n) (see Definition [3]) and eq.©, we have 



Nf D (n) = n[ 1 p^ = ± 
DK ' J± log(l + ns) z - 



iyf(l- S )F[f(x( r) ) = l 
log(l + ns) 



By eq. (fTU)) in Claim [5] and eq. dTBl . and note that y* (s) < 1, we obtain 



E[DCGr,(/, 5 n )]-iVi(n) 



< E 

r=l 



< 



E 



"F / (i- S )p[/H)) = 1 - s ](i ^r 



i 



[^.l]\-B«.r 



+ r) log(l + ns) 
1 1 



ds 



r=1 JE n , r n[±,i] 



1 



1 



log(l + r) log(l + ns) 
1 



d.s 



log(l + r) log(l + ns) 



ds + O 



logn 



-ds. 



1 " 1 

n i! -— ' 1oe(1 



(13) 



fri log(l + r) 



(14) 
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We next bound the two terms in the RHS of the last inequality of (Q3]) separately. By Claim [31 
the first term can be upper bounded by 



2e 



I 1 / 3 



£ sup 

r=l s S[i,l]\-E„, 



f 



f 



log(l + r) log(l + ns) 



< Tie 

" log2 



(15) 



For the second term in the RHS of the last inequality of (TT41) , it is easy to check that the following 
two inequalities hold: 



Vr > n 2/3 , sup 

ses„ >r n[i,i] 



1 



1 



log(l + r) log(l + ns) 
Vr < n 2/3 



,2/3 



< 



(l + r)log^(l + r) \(l+r)log z (l + r) 



7 2 /3 



sup 

seE n r n[i.il 



1 



log(l + r) log(l + ns) 
Combining ([13]), (TT5J) , (fT6)l and ([T7j), we obtain 



< 



E[DCG D (/,S„)] -iV^(n) 



2ne- 2ne / // 

< : : + 



2/3 



log 2 



S..7I 



log 2 



,2/3 



(16) 

(17) 



lo S 2 J^/» (1 + r) log '(1 + r) 



< O (n 2 / 3 ) 



(18) 
Finally, observe that F(np) = Li(l + np), where Li is the offset logarithmic integral function. By 
Claim @] and the well-known fact Li(n) ~ p 2 ^) we have the following inequality and this completes 
the proof. 

E[DCG D (/,5„)] 



;[NDCG C (/,S„)]- 



Li(l + rap) 



<0(n- 1 / 3 )+0(e- 2 " V3 ) 



We next turn to prove Lemma [3[ We need the following three claims. 
Claim 5. For sufficiently large n, 



log fe xAx = O 



log fc i 



(19) 
□ 



(20) 



Claim 6. Fix an integer k £ N* = {0} U N. For sufficiently large n, 



log fe x 



d.i; 



(log(rw;)) 



fc+i 



<0 



log fc+1 n 



Claim 7. span ( {log x}fc>o), is dense in L 2 [0,1] 
Now we are ready to prove Lemma [3J 



(21) 



Proof, of Lemma [3j Let Ay(s) — y*°(s) — y^(s). By the definition of normalized pseudo expec- 
tation (see definition [5]) and the fact that |Ay(s)| < 1, we have 



,-/o 



Ng(n)-Ng(n) 



1 Ay(l - s)ds 
Li(l + np) J ' j, log(l + ns) 



1 Ay(l - S )ds , Q ^ 1 



Li(l + np) J 2. log(l + ns) \^K r 



(22) 
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Expanding lo ii+ ns \ a t the point ns, we obtain 

nl Ay(l - s)ds f 1 Ay(l - s)ds 



log(l + ns) 



log n + log s 



< 



As 



ns log (ns) 



<0 



logn 



Expanding 



log n+log s 



at point logn, we have that for all m £ N* , the following holds: 



1 Ay(l-s)ds ^f-lV'- 1 f 1 A . ., ,_! , 

t V -E , / Ay(l- S )log J l sds 

| log n + logs ^ log- 7 n J 2. 



1 Ay(l - s) log m sds 



(logn + ^s)"^ 1 



< 



Ay(l - s) log™ s 



d.s 



(logn + logs) m+1 



<o 



log n 



(23) 



(24) 



Note in above derivation that £„ )S S (logs, 0)), and the last inequality is due to Claim [5] 

Furthermore, by Claim [3 unless Ay(s) — a.e., there exist constants k £ W and a ^ such 
that 

(-l) fe J Ay(l - s) log & s As = a. (25) 

Jo 

Let K be the smallest integer k that Eq. J53J) holds. Combining <f22]>. (1231) . (|2l| . and (|2Sj! and 
noting Claim we have the following and this completes the proof. 



r/o 



Ng(n)-Ng(n) 



log n 



<o^ + o 



log* +1 n 



D 



To prove the last key lemma, we need the following claim. 



Claim 8. Let D(r) — locr / 1+r \ . Let F(t) = f. D(r)dr. Assume y*(s) is Holder continuous with 
constants a and C . Then 



^y/(l - r/n)D r - N f D (n) < Cn~ a / 3 F(n) + D(l) + \D'(l)\. 



(26) 



Now we prove the last key lemma. 

Proof, of Lemma [4l Let xi, • • ■ ,x n be instances i.i.d. drawn according to Px- Let 5( r ) = f{x/ r \) 
and by definition xm > £(2) > ■ ■ ■ > x< n ) ■ By Chernoff bound, for every r with probability 2t 
we have \xr r ) — (1 — r/n)\ > n -1 ' 3 . A union bound over r then yields 



-2n L 



Pr 



Vr £ [n], 



C (r) 



r 

1 

n 



< n- 1 / 3 



> 1 - 2ne~ 



,V3 



Since y* is Holder continuous with constants a and C, eq. (j2"7) implies 



(27) 



Y,y f (Z(r))D(r) - £>' (1 ~ r/n)D{r) < Cn^' 3 -^D(r) 



r=l 



> 1 - 2ne~ 



,1/3 



Combining Claim[8]and eq. (|28|). and note that \D'(1)\ + D(l) < 10 we have 



Pr 



J2V f (x(r))D(r) - N } D {n) < 2Cn~ a/3 ■ F(n) + 10 



> 1 - 2ne~ 



(28) 



(29) 
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Fix xi,...,x n - Let xL-. , . . . , xl-, be the induced ordered sequence. Also let X( r ) = f(x,-.). 
Recall that y f (s) = E[Y\f(X) = s]. Thus J2r=i y f (x(r))D{r) is the expectation of DCG D (/, S n ) = 
J2r=iy(r)D( r ) conditioned on the fixed values i(i), ■ • • , £(«)• Also observe that conditioning on 
5(i), ■ • ■ , 5(n), 2/f i ( r — 1, ■ • ■ , n ) are independent. By Hoeffding's inequality and taking into con- 
sideration that xi, . . . ,x n are arbitrary and (D{r)) 2 < D{r) for all r, we have for every e > 



Pr 



DCG D (/,$,)-£y(x (p) )D(r) 



r=l 



> e 



< 2exp 



2e 2 
>(n) 



Set e = F(n) 2 / 3 in cq. (I3T)1) and combine eq. (j2"5)) . we have 



Pr 



DCGu(/, S n ) - N f D (n) > 2Cn- a ^F{n) + 2F(n) 2 / 3 < 2n e - 2n + 2 e - 2F( ' 
Simple calculations yields 
DCG D (f,S n 



,1/3 



Pr 



F(np) 



N' D (n) 



> 4Cp ^ 



-!« — niin(a/3,l) 



< 



2ne- 2nl/3 +2e- 2F ^ 1/3 . 



Combining eq. (JT2]) and (|32p The lemma follows. 



(30) 

(31) 

(32) 
D 



C Proof of the Technical Claims in Appendix 

Here we give proofs of the technical claims by which we prove the three key lemmas in Section iBl 



Recall that for each i e [n], f{xi) is uniformly distributed on [0, 1]; and iL, . . . , x, n \ are just 



Proof, of Claim [TJ 
Recall that for e 
reordering of Xi, . . . , x n . Thus 

n n 



Proof, of Claim H 
We have 



E\DCG D (f,S n )] = J2 D ^ E 

r=l 
n 

= y 1 

^ loa(l 



% 



-{ log(l + r) 



E 



E[tff r) l/H))] 



f^ io g (i + r) y 



./ A _ 



f{x{ r) ) = s y J (s)ds 



□ 



(33) 



D 



Proof of Claim [3J Just observe that f(xj r ^) is the r-th order statistic (r-th largest) of n uniformly 
distributed random variables on [0,1]. Chernoff bound yields the result. □ 
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Proof, of Claim [4J 

Let I = J2( x v)es ^Iv = ■*■] ^ e * ne num ber of y = 1 in S„. Since S„ is sampled i.i.d. and 
Pr[Y = 1] = p, by Chernoff bound we have 



Pr 



l/n — p 



> n 



-1/3 



< 2e 



-2n x / 3 



(34) 



Thus with probability at least 1 — 2e 2 " 

DCG D ( f,S n ) 



NDCG D (/,S„) 



F[np) 
DCGc(/,5 n ) DCGi,(/,5„) 



I 



< DCG D (f,S„)-max 



F(np) 



1 



F{n(p - n-V3)) F(np) 



F(n(p + n-V3)) F(np) 



Recall that ,F(t) = f. lo ,* , dr, p > 0; and observe that DCG£>(/, S n ) < F(n). Taylor expan- 



sion of 



F((p±n- 1 /3)„) 



at np and some simple calculations yields the result. 



Proof, of Claim [5l 

Integration by part we have, 



flog k xdx = k!j2(-l) k 



_j x log- 7 x c 



The claim follows. 



Proof, of Claim El 

Changing variable by letting x = n~ l we have 



D 



(35) 
□ 



log fe x 



dx 



(log(nx)) 
log s t" 



fc+l 



(l-t) fc+1 

1/2 jfc 



e -*logn df 



e- tl °s n dt - 

o (i-t) fc+1 ■ y 1/2 (i-t)*+^ 



■n^r f 



,-tloB 



Mi. 



(36) 



Now we upper bound the two terms in the last line of eq. (f3"oT) separately. For the first term we 
have 



1/2 



t k 



< 



o (1 " t) k+1 
2 fc+i 



1/2 
e -tlogn dt < 2 fe+l J t k e -tlo g n dt 





(\ogn) k+1 



' ,_ TA ^2^T(k + l) 



T K e~ T dT < 



(\ogn) k 



fe+l 



O 



(logn 



.fe+i r 



(37) 
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where T is the gamma function, and the last inequality is due to that k is a fixed integer. 
For the second term we have 

\ fe+i 
logn \ 



l- 

1/2 



(1 - t)*+l 



e -ilogn di< 



log 2 



-tlogn dt 



1/2 



k+l 



< I.J_. 1212) = ofJ_ 

- 2 v« v log2 / ^^ 

where in O we hide the polylog(n) terms. 

Combining (157)) and (|55|) we complete the proof. 



Proof, of Claim [3 

We only need to show that for any / e L 2 [0, 1], if 



/ /(x) log fe xdx = 0, fc = 0,l, 
Jo 



(38) 



D 



(39) 



then / = a.e. on [0, 1]. 

Let t = — logx, then eq. (|39|) becomes 



/(e - *)t"e-*dt = 0, fc = 0,l,... 

Note that Laguerre polynomials form a complete basis of L 2 [0, oo) (cf. 30], p. 349) , thus {t k }k>a 
is complete in P 2 [0, oo) with respect to measure e~*. The claim follows. □ 

Proof, of Claim H 



I ^2y f (l - r/n)D(r) - N f D {n) 

r=l 

= yW(l - r/n)D(r) - / y f (l-s/n)D 
1 r=l -/i 



(s)ds 



n-l „r+l 

51/ (y / (l-r/n)D(r)-y / (l-s/7i)L>(s))ds + j/ / (0)P>(n) 

r=l"' r 
n-1 /T+l 



< Z) / /(l-s/n)(^(r)- J D( S ))d S 

r=l "' r 
n-1 „ r +i 

+ Z / y f (l-r/n)-y f (l-s/n) D(r)ds + y f (0)D(n) 

n— 1 />r+l n— 1 

- H / ^(0 - D(s) ds + C?^ a/3 ^ D(r) + L»(n) 

r=l"' r r=l 

n-1 

< Z)l I, '( r )l+c f »~ a/3 - F (») + ' D (») 



< 



Cn- a / 3 F(n) + |£>'(1)| + ^ |D'(r)| + D(n) 



< Cri" a/3 F(n) + |D'(1)| + D(l) - D(n) + D(re) 

= C*n- tt / 3 J F 1 (n) + | J D'(l)|+D(l). 
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Note that the sixth and the seventh line are both because |-D'(»-)| is monotone decreasing; and second 
line from bottom is because D(r) is monotone decreasing. □ 

D Proof of the Convergence Theorems 

In this section we give the proof of the theorems considering convergence of NDCG with various 
discount and cut-off. 

First we give the proof of Theorem [TJ i.e., the standard NDCG converges to 1 almost surely for 
every ranking function. 

Proof, of Theorem QJ For notational simplicity we only prove for the case y = {0, 1}. Generaliza- 
tion is straightforward. Recall that S„ = {(x\, j/i), • • • , (x n ,y n )} consists of n i.i.d. instance-label 
pairs drawn from an underlying distribution Pxy- Let p = Pr(y = 1). Also let I — Y)j—y y%- If 
p = 0, the theorem trivially holds. Suppose p > 0, by Chernoff bound we have 



P 



I 

P 

n 



>n- 1 / 3 \ <2e- 2 " 1/3 . 



For fixed n, conditioned on the event that I— — p\ < n ^-Z 3 , by the definition of NDCG, it is easy 
to see that 



NDCG D (/,S n ) 



En f 1 

r=l V(r) log(l+r) 



L^r=\ 



> 



log(l-fr) 

1 
n-l+1 log(l+r) 



En i 

r- 



l^r=l 



log(l+r) 



,-2n x / 3 



Li(n + 1) - Li(n(l -p + ?i~ 1/3 ) + 1) 

Li(n(p + n- 1 /3) + i) ° ( > 

> I -o(l), (40) 

where Li(t) = J 2 -^^ is the offset logarithmic integral function; and the last step in eq. (j40l) is due 
to the well-known fact that Li(i) ~ j^~i- Thus for any e > 0, and for any sufficiently large n, 
conditioned on the event that NL ^™ =1 yi~ p\ < n -1 ' 3 , we have 

NDCGu(/,5„)>l-c. 

Also recall that NDCGu(/, S n ) < 1. We have, for any e > and every sufficiently large n 

Pr(|NDCG D (/,5 n )-l|>e)<2e- 

Since Yl n >i 2e _2n < oo, by Borel-Cantelli lemma NDCGd(/, S n ) converges to 1 almost surely. 

□ 

Next we give details of the other feasible discount functions as well as the cut-off versions. In 
particular, we provide proofs of Theorems [31 [SJ O [HI El The proofs of these five theorems are quite 
similar. We only prove Theorem [3J to illustrate the ideas. The proof of the other four theorems 
require only minor modifications. 

The proof of Theorem [3] relies on the following lemma, which is similar to Lemma 0J 



23 



Lemma 5. Let D(r) = r _/3 for some f3 € (0, 1). Assume that p = Pr(Y" = 1) > 0. If the ranking 
function f satisfies that y* (s) = Pr(Y" = l\f(X) = s) is continuous, then for every e > the 
following inequality holds for all sufficiently large n: 



Pr 



NDCG D (/,S„) - N^n) > Bp^el < o(l) 



Proof, of Theorem[3j The theorem follows from Lemma[S]and simple calculations of lirrin^oo N* D (n) . 
We omit the details. □ 

Proof, of Lemma [5j The proof is simple modification of the proof of Lemma HI Note that the 
difference of Lemma [5] from Lemma 0] is that here we do not assume yf is Holder continuous. We 
only assume it is continuous. 

Next observe that Claim [8] holds for D(r) = r~@ (0 < (3 < 1) as well. Because in the proof 
of Claim [U we only use two properties of D(r). That is, D(r) is monotone decreasing and |Z?'(r)| 
is monotone decreasing. Clearly D(r) = r~@ satisfies these properties. But here y (s) is merely 
continuous rather than Holder continuous. Thus we have a modified version of Claim [8] That is, 
for every e > 0, the following holds for all sufficiently large n: 

n 

I ^/(l - r/n)D{r) - N f D {n) < eF(n) + D(l) + \D'(1)\. 

r=l 

The rest of the proof are almost identical to Lemma |H We omit the details. □ 

Finally, we give the proof of Theorem [6j i.e., if the discount decays substantially faster than 
r , then the NDCG measure does not converge. Moreover, every pair of ranking functions are not 
strictly distinguishable with high probability by the measure. 

Proof, of Theorem [6]. For notational simplicity we give a proof for \y\ = 2 and y = {0, 1}. It is 
straightforward to generalize it to other cases. 

In fact, we only need to show that for every ranking function /, there are constants a,b,c> 
with a > b, such that for all sufficiently large n, 

Pr[NDCG D (/,5„)>a]>c 

and 

Pr\NDCG D (f,S n )<b)>c 

both hold. Once we prove this, by definition the ranking measure does not converge (in probability). 
Also, it is clear that for every pair of ranking functions, there is at least a constant probability that 
the ranking measure of the two functions are "overlap" . Therefore distinguishability is not possible. 
For sufficiently large n, fix any x\, . . . ,x n . According to the assumption, the probability that 
the top-ranked m data all have label I is at least (S/2) m , where m is the minimal integer such that 

m ~ oo 

5>(r)>gX>(r). 

r— 1 r— 1 

Clearly we have 



PrfNDCG D (/,&,)>| 



£i,...,x„ >{S/2Y 



On the other hand, the probability that the top-ranked m elements all have label and there 
are at least m elements in the list that have label 1 is at least (6/2) 2m . Note that 

£"= m +i D ( r ) < 1 
EZiD(r) -2 
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Thus we have 



Pr[NDCGz,(/, S n ) < - | x u . . . ,x n ] > (S/2) 2m . 



Since X\,...,X n are arbitrary, the theorem follows. 



□ 



E Proof of Distinguishability for NDCG with r~$ (/3 € (0, 1)) 
Discount 

Here we give the proof of Theorem [4j i.e., NDCG with r _/3 (0 < j3 < 1) discount has the power of 

distinguishability. 



Proof, of Theorem [4j The proof of distinguishability for polynomial discount is much easier than 
that of the logarithmic discount, because in the former case the pseudo-expectation has very simple 
form. If /o and f\ satisfy the first condition J Ay(s)(l — s)~^ds ^ 0, then the theorem is trivially 
true since NDCG(/o, S n ) and NDCG(/i, S n ) converge to different limits. So we only need to prove 
the theorem assuming that L Ay(s)(l — s)~ l3 ds = and the second condition holds. The proof 
is similar to Theorem [2] by using the pseudo-expectation. We have the the next two lemmas for 
discount D(r) =r^,^e (0, 1). 

Lemma 6. Let D{r) — r~^, f3 £ (0,1). Suppose that y'° (s) andy' 1 {s) are continuous. Also assume 
that J„ Ay(s)(l — s) _/3 ds = and Ay(l) ^ 0. Then we have 



Proof. 



N£(n)- N f jJ(n) 



Nf°(n)-N£(n) = 



> 



Ay(i) 



2p 



1-/3 



-(1-/8) 



(41) 



F(np) 



l/n 



Ay(l - s) ■ (ns)~ p ds 



l/n 



Ay(l - s) ■ s- ds 



1 - R f 1 / 11 

^4 / Ay(l - s) ■ s^ds. 

P ' Jo 



Since Ay is continuous, for any S > there exists e > such that for all x £ [1 — e, 1], 
\Ay(x) — Ay(l)\ < 5. Consequently, for sufficiently large n, 



/ Ay(l - a) ■ s-^ds - Ay(l) • / . 
Jo Jo 



-Pd* 



<6- 



l/n 



3-^ds. 



Let S — Ay(l)/2, we then have 

-l/n 



f 1/n A 

/ Ay(l - s) • s-^ds - - 

Jo ' 



Ay(l) 



< 



Ay(l) 

2(1-/?) 



The lemma follows. 



□ 



Lemma 7. Let D(r) = r ", /3 € (0, 1). Assume that p — Yi{Y — 1) > 0. // £/ie ranking function 
f satisfies that y (s) = Pr(y = l\f(X) — s) is Holder continuous with constants a > and C > 
ITiaiis, \V f (s) - y f (s')\ < C|s - s'| Q /or aWs.s' £ [0,1]. TTien 



Pr 



NDCG D (/,S n )-iv£(n) 



> SCp" 1 ^ 



min(a/3,l) 



<o 



( e -» (1 - 



3)/3 
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Proof. The proof is almost the same as the proof of Lemma 0] □ 

The theorem follows immediately from Lemma [5] and Lemma [7] □ 
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